title

Created by:

  1. The Begging
    1. Importan imports
    2. Orginal data
    3. Selecting correlated features
  2. Precise analysis of important attributes
    1. Parallel Coordinates Plot
    2. Heat map
    3. The influence of markers on the outcome
    4. How markers are correlated?
  3. Other features
    1. Facet Scatter plot
    2. Normal Scatter plot
    3. Violin plot
    4. Checking hypothesis
  4. Summary

The Begging


Importan imports


Orginal data

We have to start from read the data, and transform it into something useful. We decide that we just group by id of patients, and use mean value as our new 'main' value.


Selecting correlated features

At the begging of our adventure we have to decide which feature are at most interesting. We select most correlated features with outcome. Selected features have correlation higher than 0.5 with outcome.

From the "Most correlated functions" bar chart, we can easily read the most important blood features influencing the patient's death.

The biggest correlation have % amount of lymphocytes and neutrophils. At the bottom of list are also % amount of monocytes and eosinophils. Admittedly, on the chart is only neutrophils count, which might suggesting that this is a white cell type which is the most important, and all correlation of other white cells is only random.

Worth noting is that on list appear age, which suggesting that some part of society is more vulnerable.

Precise analysis of important attributes

Parallel Coordinates Plot

Parallel coordinates plot, illustrating dependency between blood features which have correlation with outcome higher than 0.7.

Thanks to the graph of "parallel coordinates", we can easily separate patients with a certain range of attribute values. By selecting a given range on the axes, only those patients which are within the range are highlighted. To reset the range, double-click the selected axis.


Heat map

By analyzing the markers for COVID detection, we decided to examine the correlation between them and the result, and then check how age affects the values blood features.

The influence of markers on the outcome

From previous heatmap of correlation we decide to compare influence of HSC on outcome with age and gender.

In a healthy person the concentration is not high, it does not exceed 5 mg / l, but in COVID patients "HSC" increases strongly.

Selecting the range from 0-50 mg / l to all patients' mortality rate is only ~ 0.15. However, if we choose a larger HSC range, the mortality rate increases significantly.

By clicking on the red / green bar, we can easily distinguish recovered or dead patients, thanks to which we observe that with age and the concentration of "HSC", the mortality drastically increases.


How markers are correlated?

When selecting a given group of patients on one graph, the same patients are also marked on the rest of the graphs, as a result, we can observe all parameters of a given group.

Considering all the graphs,the greatest discrepancy in values can be observed in the case of Lymphocyte. Nevertheless, there is a significant trend in mortality as lymphocyte levels decline.

In the case of lactate dehydrogenase and HSC, there is a some limit after which there are only a few cases of recovered patients: Lactate dehydrogenase ~ 300[IU/l],HSC ~ 50 [mg/l].

Common to all graphs is the trend of mortality that increases with patient age.


Other features

For the next plot we want to see how amount of Thrombocytocrits and Serum sodium influences at probability of not surviving Covid-19 at different stages of life, and how it is different between the sexes. To do this, we have to add additional column with different stage of life.

Facet Scatter plot

At the chart above we can observe few things.

First of all their is not so much people before 40 (young and adult category). It may means that people at this stages of life have lower probability of severe disease and being in need of hospitalization or that in early days of pandemia(samples come between 2020-01-10 and 2020-02-18) there was bigger need o taken care of older people.

Secondly, seniors, which have lower ratio of thrombocyte volume to plasma(thrombocytocrit) are in group of people of high risk

Thirdly, all people whom serum sodium is much above norm(145mmol/l) have died. It is only about 30 people


Normal Scatter plot

On this plot we can observe that very high amount of blood samples of people, which albumin and calcium level was both below normal expected values(3.5 g/dl and 2.1 mmol/l respectively), which are represented by horizontal and vertical line, are in group of people of high risk.


Violin plot

Now we want to see how density of prothrombin activity in population is important for morality rate, and see how it is depended between female and male, in different ages.

From the chart we can see that Prothrombin activitytheir is no depended of genders. Another thing that we can observe is that all people which % of norm activity of this protein factor is below 60% end up dead. Normal values oscillate around 70% - 130%, and around half of people who die are in this interval, so this factor shouldn't be a decider.


Checking Hypothesis

At the end of Selecting correlated features we stated a hypothesis, that from all white cells only neutrophils amount count.

It is clearly visible that lymphocyte and monocytes have no impact on outcome. Values of eosinophil and basophil are too small and not so different to draw any conclusion. Only in the case of neutrophils the amount is satisfying, and is sufficiently distinguishable to say, that amount of it important for health.


Summary


Background:

In the end of 2019 in Wuhan, China, the SARS-CoV-2 virus was identified. On 30 January the WHO declared a Public Health Emergency of International Concern regarding COVID-19. On 11 March world pandemic was announced. Many researchers from the whole world start doing everything they can, to help others with fighting versus common enemy. Few of them are a research group who publish this article on 14 May of 2020. Now, almost 1 year later, we, young students of Artificial Intelligence at Poznan University of Technology, are facing the challenge of preparing a reproducible, standalone HTML report in Python containing at least 6 visualizations.

Main Part:

At the Beginning we get a file containing data of 375 people's blood tests with basic information about patients and outcome of his illness. We preprocess a little the data and choose features which have the biggest correlation with outcome. Then, we take a closer look at a few of them. For example we create a heat map at which we show correlation of Lactate dehydrogenase, High sensitivity C-reactive protein, % amount of lymphocyte, age, gender and outcome with each other. Then, we look at the influence of HSC on the outcome taking under consideration age and gender. Next we compare the previous plot with similars one, but with different main attribute. Last but not least, we look for something interesting in the rest of data. We start from comparison of how thrombocytosis and serum sodium is decomposed at different stages of life. Later we take a closer look at how the amount of serums: albumin and calcium affect (or is affected by?) Covid-19. Next we compare the density of Prothrombin activity in blood cells of patients, and look for any interesting conclusions. The Grande Finale: comparison of amount of few white cells types and impact of it on possible outcome.

Conclusions:

Making plots is fun